-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-22460 : Reopen regions with very high Store Ref Counts #600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| for (ServerMetrics serverMetrics : serverMetricsMap.values()) { | ||
| Map<byte[], RegionMetrics> regionMetricsMap = serverMetrics.getRegionMetrics(); | ||
| for (RegionMetrics regionMetrics : regionMetricsMap.values()) { | ||
| final int regionStoreRefCount = regionMetrics.getStoreRefCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This count is the sum of ref count of all HFiles under each of the CF in this region right? So the sum is not just a factor of many refs on a file. It is a function of number of hfiles under this region. What is it is having so many files at a time and each of that having say <10 active refs. So 250 is obviously not a correct number IMO. Is there a way we can see refs per HFile? No need to know ref counts on each of the HFile. Max number may be enough. I believe this decision should be made based on the so many ref counts on a file rather than sum of. Thoughts?
|
|
||
| private List<HRegionLocation> prepareRegionsForReopen(MasterProcedureEnv env) { | ||
| List<HRegionLocation> regionsToReopenList = new ArrayList<>(); | ||
| List<HRegionLocation> tableRegionsForReopen = env.getAssignmentManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better pass this List to this private method than passing env.
| } | ||
|
|
||
| private List<HRegionLocation> prepareRegionsForReopen(MasterProcedureEnv env) { | ||
| List<HRegionLocation> regionsToReopenList = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid the terms like List/Map from the variable name?
|
🎊 +1 overall
This message was automatically generated. |
| int getStoreRefCount(); | ||
|
|
||
| /** | ||
| * @return the max reference count for any store among all stores of this region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the max ref count on a Store or on a StoreFile? The latter right? Pls change the javadoc and name of method accordingly. That will be very clear name then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And name to be corrected accordingly in all related places.
| storeFileCount += r.getStoreFileCount(); | ||
| int currentStoreRefCount = r.getStoreRefCount(); | ||
| storeRefCount += currentStoreRefCount; | ||
| maxStoreRefCount = Math.max(maxStoreRefCount, currentStoreRefCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RegionMetrics should give maxStoreFileRefCount also and that should be considered here.
| <name>hbase.regions.recovery.store.count</name> | ||
| <value>256</value> | ||
| <description> | ||
| Store Ref Count threshold value considered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls correct this desc accordingly
|
|
||
| private static final String REGIONS_RECOVERY_INTERVAL = | ||
| "hbase.master.regions.recovery.interval"; | ||
| private static final String STORE_REF_COUNT_THRESHOLD = "hbase.regions.recovery.store.count"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is store file ref count for recovery. Pls change config name such a way to indicate this.
| "Error reopening regions with high storeRefCount. "; | ||
|
|
||
| private final HMaster hMaster; | ||
| private final int storeRefCountThreshold; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This var name too
| @Override | ||
| protected void chore() { | ||
| if (LOG.isTraceEnabled()) { | ||
| LOG.trace("Starting up Regions Recovery by reopening regions based on storeRefCount..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting up Regions Recovery chore for reopening .......
| LOG.error("Error while reopening regions based on storeRefCount threshold", e); | ||
| } | ||
| if (LOG.isTraceEnabled()) { | ||
| LOG.trace("Exiting Regions Recovery by reopening regions based on storeRefCount..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log to be corrected.
| // is beyond a threshold value, we should reopen the region. | ||
| // Here, we take max ref count of all stores and not the cumulative count | ||
| // of all stores. | ||
| final int maxStoreRefCount = regionMetrics.getMaxStoreRefCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All places maxStoreFileRefCount pls.
| // For each region, each store file can have different ref counts | ||
| // We need to find maximum of all such ref counts and if that max count | ||
| // is beyond a threshold value, we should reopen the region. | ||
| // Here, we take max ref count of all stores and not the cumulative count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max count on store or on a store file? Pls make it consistent. The above line says it is store file and that is what ideally needed too. I doubt whether we are doing that. I think what we actually doing is passing the max store ref count (sum of ref counts of all store files under that store)
| // Here, we take max ref count of all stores and not the cumulative count | ||
| // of all stores. | ||
| final int maxStoreRefCount = regionMetrics.getMaxStoreRefCount(); | ||
| if (maxStoreRefCount > storeRefCountThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider the system table different? META table might some time get too large ref count. It is possible. May be many new clients are started/restarted or many regions move/split etc. What if at the time of reporting it had a huge ref count. Reopening of META will affect the whole system! Anyways the def count of 256 (of now) is way less for any kind of regions IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, better we exclude META.
Updated PR with all the suggestions, started using store file's max refCount
|
💔 -1 overall
This message was automatically generated. |
| storeCount += r.getStoreCount(); | ||
| storeFileCount += r.getStoreFileCount(); | ||
| storeRefCount += r.getStoreRefCount(); | ||
| maxStoreFileRefCount += r.getMaxStoreFileRefCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should do max() here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is addition of all counts for toString() right? I thought of doing max() but somehow I feel all these counts should be addition of all regionMetrics for the sake of toString() only. Still, I agree better to include max of all storeFileRefCount
| <description> | ||
| Regions Recovery Chore interval in milliseconds. | ||
| This chore keeps running at this interval to | ||
| find all regions with high store ref count and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say above a configurable max value
| </description> | ||
| </property> | ||
| <property> | ||
| <name>hbase.master.regions.recovery.interval</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name it like hbase.master.regions.recovery.check.interval (?) Is that a better name?
| int maxStoreFileRefCount = 0; | ||
| Collection<HStoreFile> hStoreFiles = this.storeEngine.getStoreFileManager() | ||
| .getStorefiles(); | ||
| for (HStoreFile storeFile : hStoreFiles) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return this.storeEngine.getStoreFileManager().getStorefiles().stream()
.filter(sf -> sf.getReader() != null).filter(HStoreFile::isHFile)
.mapToInt(HStoreFile::getRefCount).max()
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes same thing but with streams, so better
| </property> | ||
| <property> | ||
| <name>hbase.regions.recovery.store.file.count</name> | ||
| <value>128</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be a default of 1000 or so better.
Ya safe side leave META region from this as of now?
f58691f to
86a1644
Compare
|
💔 -1 overall
This message was automatically generated. |
|
|
||
| // Regions Recovery based on high storeFileRefCount threshold value | ||
| public static final String STORE_FILE_REF_COUNT_THRESHOLD = | ||
| "hbase.regions.recovery.store.file.count"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You missed ref from the config name. Its not store file count threshold but the ref count threshold
| </description> | ||
| </property> | ||
| <property> | ||
| <name>hbase.regions.recovery.store.file.count</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya change here too
| choreService.cancelChore(this.replicationBarrierCleaner); | ||
| choreService.cancelChore(this.snapshotCleanerChore); | ||
| choreService.cancelChore(this.hbckChore); | ||
| choreService.cancelChore(this.regionsRecoveryChore); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the chore was not scheduled, this cancel call wont make any issue right? Just confirming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it won't make any issues, I confirmed it
| </property> | ||
| <property> | ||
| <name>hbase.regions.recovery.store.file.count</name> | ||
| <value>-1</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should allow users to change this with out restarting the HM. But on a followup issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean similar to _switch command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.. Via ConfigurationObserver way and allow changes to be impacted with out restart of process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, looking forward to it on follow up Jira
| // count of all store files | ||
| final int maxStoreFileRefCount = regionMetrics.getMaxStoreFileRefCount(); | ||
| // ignore store file ref count threshold <= 0 (default is -1 i.e. disabled) | ||
| if (storeFileRefCountThreshold > 0 && maxStoreFileRefCount > storeFileRefCountThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storeFileRefCountThreshold > 0 check is a redundant one here and can be avoided?
|
|
||
| // Specify specific regions of a table to reopen. | ||
| // if specified null, all regions of the table will be reopened. | ||
| private final List<byte[]> regionNamesList; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls avoid List, Map etc from the variable name if possible.
| } | ||
| regions = | ||
| env.getAssignmentManager().getRegionStates().getRegionsOfTableForReopen(tableName); | ||
| List<HRegionLocation> tableRegionsForReopen = env.getAssignmentManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be all the regions of this table right? Better name for the var will be tableRegions.
| env.getAssignmentManager().getRegionStates().getRegionsOfTableForReopen(tableName); | ||
| List<HRegionLocation> tableRegionsForReopen = env.getAssignmentManager() | ||
| .getRegionStates().getRegionsOfTableForReopen(tableName); | ||
| regions = prepareRegionsForReopen(tableRegionsForReopen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the method is not self explaining. What prepare it is doing? This basically gets the Region locations of the mentioned region names right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can even pass the regionNamesList also here.
|
|
||
|
|
||
| [[hbase.regions.recovery.store.file.count]] | ||
| *`hbase.regions.recovery.store.file.count`*:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also the config name to be corrected.
| + | ||
| .Description | ||
|
|
||
| Store files Ref Count threshold value considered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can say bit more detail here. Very large ref count on a file indicates that its a ref leak on that object. Such files can not get removed even after its invalidation via compaction or so. Only way to come out of such situation is the reopen of the region
|
Updated the PR based on latest comments. Please review @anoopsjohn |
5cc5910 to
9fe22e2
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
9fe22e2 to
dbe98f3
Compare
|
💔 -1 overall
This message was automatically generated. |
|
@apurtell @anoopsjohn The status on this so far:
Thanks |
anoopsjohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
|
@apurtell Please review as per your convenience |
|
💔 -1 overall
This message was automatically generated. |
Jira: https://issues.apache.org/jira/browse/HBASE-22460